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ABSTRACT 

Multidimensional  synchronous  dataflow  (MDSDF)  provides 
an  effective  model  of  computation  for  a  variety  of  multidi¬ 
mensional  DSP  systems  that  have  static  dataflow  structures. 
In  this  paper,  we  develop  new  methods  for  optimized  imple¬ 
mentation  of  MDSDF  graphs  on  embedded  platforms  that  em¬ 
ploy  multiple  levels  of  parallelism  to  enhance  performance  at 
different  levels  of  granularity.  Our  approach  allows  design¬ 
ers  to  systematically  represent  and  transform  multi-level  par¬ 
allelism  specifications  from  a  common,  MDSDF-based  ap¬ 
plication  level  model.  We  demonstrate  our  methods  with  a 
case  study  of  image  histogram  implementation  on  a  graphics 
processing  unit  (GPU).  Experimental  results  from  this  study 
show  that  our  approach  can  be  used  to  derive  fast  GPU  im¬ 
plementations,  and  enhance  trade-off  analysis  during  design 
space  exploration. 

Index  Terms —  Dataflow  graph,  multidimensional  syn¬ 
chronous  dataflow,  graphics  processing  unit,  integral  his¬ 
togram. 

1.  INTRODUCTION 

Dataflow  models  are  widely  used  for  expressing  the  func¬ 
tionality  of  digital  signal  processing  (DSP)  applications,  such 
as  those  associated  with  audio  and  video  data  stream  pro¬ 
cessing,  digital  communications,  and  image  processing  (e.g., 
see  [1]).  Dataflow  provides  a  formal  mechanism  for  de¬ 
scribing  specifications  of  DSP  applications,  imposes  minimal 
data-dependency  constraints  in  specifications,  and  is  effective 
in  exposing  and  exploiting  task  or  data  level  parallelism  for 
achieving  high  performance  implementations.  Synchronous 
dataflow  [2]  has  been  popular  in  design  of  DSP  applica¬ 
tions  because  of  its  useful  features,  including  compile-time, 
formal  validation  of  deadlock-free  operation  and  bounded 
buffer  memory  requirements,  as  well  as  support  for  efficient 


scheduling  and  buffer  size  optimization  [1].  However,  the 
SDF  model  is  well  suited  only  for  one-dimensional  DSP 
algorithms,  such  as  those  in  the  domains  of  speech,  audio, 
and  digital  communication.  Multidimensional  synchronous 
dataflow  (MDSDF)  [3]  is  a  generalization  of  SDF  to  multi¬ 
ple  dimensions.  MDSDF  provides  an  effective  model  for  a 
variety  of  multidimensional  DSP  systems  that  have  statically 
structured  dataflow  characteristics. 

In  this  paper,  we  develop  new  methods  for  efficient  imple¬ 
mentation  of  parallel  processing  solutions  for  signal  process¬ 
ing  systems  using  MDSDF  representations.  Our  proposed 
design  methods  apply  dataflow  transformations  to  exploit 
data  parallelism  hierarchically  for  multidimensional  dataflow 
graphs.  Our  design  methods  provide  a  systematic  approach 
for  exposing  and  exploiting  parallelism  from  multidimen¬ 
sional  dataflow  specifications  across  different  levels  of  the 
specification  hierarchy.  We  demonstrate  our  proposed  new 
modeling  techniques  and  design  methods  by  applying  them  to 
optimize  implementations  on  the  NVIDIA  graphics  program¬ 
ming  unit  (GPU)  programming  model  [4].  Using  our  new 
MDSDF-based  design  techniques,  we  demonstrate  efficient 
GPU  implementations  for  integral  histogram  computations, 
which  form  an  important  class  of  image  processing  oper¬ 
ations  for  surveillance  and  monitoring  applications.  The 
results  of  our  experiments  demonstrate  concretely  that  our 
proposed  design  methods  are  effective  in  mapping  formal  de¬ 
sign  models  for  multidimensional  DSP  systems  into  efficient 
implementations  on  complex  multicore  processors. 

2.  RELATED  WORK 

A  variety  of  dataflow  based  design  tools  has  evolved  in  recent 
years  for  design  and  implementation  of  signal  processing  sys¬ 
tems  (e.g.,  see  [1,  5,  6]).  In  this  section,  we  summarize  a  num¬ 
ber  of  recent  efforts  beyond  MDSDF  (see  Section  1)  that  have 
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focused  especially  on  multidimensional  dataflow  modeling. 

Keinert  et  al.  propose  an  extension  of  MDSDF,  called 
windowed  synchronous  dataflow  (WSDF)  [7].  WSDF  allows 
modeling  of  sliding  window  algorithms  for  a  multidimen¬ 
sional  applications.  Array-OL  [8]  is  a  language  devoted  to 
applications  that  involve  multidimensional  intensive  signal 
processing.  Two  levels  of  description  are  used  for  modeling 
parallelism  in  Array-OL  —  one  is  the  global  model  for  defin¬ 
ing  task  parallelism,  while  the  other  is  the  local  model  for 
expressing  data  parallelism.  Blocked  dataflow  (BLDF)  [9] 
provides  meta-modeling  semantics  that  can  be  used  to  repre¬ 
sent  block-based  and  multidimensional  processing  in  terms 
of  different  specialized  dataflow  models.  BLDF  provides 
a  unified  framework  that  leads  to  efficient  dataflow  graph 
scheduling  and  memory  management. 

McAllister  et  al.  [10]  augment  the  MDSDF  model  with 
parameterized  array  expressions.  Their  modeling  approach, 
called  Multidimensional  Arrayed  Synchronous  Dataflow 
(MASD),  provides  graph  range  parameters  to  control  to¬ 
ken  dimensions  at  input  and  output  ports,  which  enables 
systematic  trade-off  exploration  between  actor  network  size 
and  token  size. 

The  distinguishing  contribution  of  this  paper  is  that  it 
presents  a  novel  design  method,  building  on  the  MDSDF 
model  of  computation,  for  hierarchical  exploitation  of  par¬ 
allelism  in  DSP  applications.  The  method  developed  in 
this  paper  helps  to  expose  multidimensional  parallelism  at 
different  design  levels  in  a  platform-independent  way,  and 
to  exploit  such  parallelism  using  suitable  platform- specific 
mapping  optimizations  at  the  back-end  of  the  design  flow. 
Graph  clustering  and  MDSDF  dataflow  analysis  are  applied 
in  novel  ways  to  provide  a  systematic  approach  for  map¬ 
ping  applications  to  DSP  platforms  that  employ  parallelism 
at  multiple  levels.  In  addition  to  motivating  and  concretely 
illustrating  our  proposed  design  method,  we  demonstrate  its 
utility  through  a  case  study  of  an  important,  practical  multi¬ 
dimensional  signal  processing  application. 

3.  MODELING 

In  this  section,  we  present  a  structured  design  method  based 
on  MDSDF  graphs  for  hierarchical  mapping  of  DSP  systems 
onto  parallel  architectures.  In  various  forms  of  data  parallel 
programming,  programmers  can  define  functions,  and  have 
multiple  calls  to  the  functions  execute  in  parallel  on  different 
data  sets  (e.g.,  see  [4,  11]).  Recent  data  parallel  programming 
environments  emphasize  support  for  exploiting  multi-level  or 
hierarchical  parallelism,  where  parallelism  is  exploited  pro¬ 
grammatically  at  multiple  levels  of  granularity.  For  exam¬ 
ple,  CUDA  [4]  provides  a  two-level  thread  hierarchy,  where 
a  set  of  threads  makes  up  a  thread  block ,  and  multiple  thread 
blocks  form  a  grid. 

Such  hierarchical  support  for  representing  parallelism  is 
important  for  multidimensional  signal  processing  applica- 
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Fig.  1.  An  example  of  a  three-actor  MDSDF  graph. 

tions,  where  parallelism  exists  in  different  forms  at  different 
levels  of  the  design  hierarchy  ( DH)  (e.g.,  inter-frame,  inter¬ 
block,  and  inter-pixel  parallelism  in  video  processing).  In  this 
section,  we  build  on  the  MDSDF  model  of  computation,  and 
develop  a  design  method  to  represent  and  apply  parallelism 
hierarchically  for  multidimensional  dataflow  graphs. 

Let  G  =  ( V,  E )  denote  an  MDSDF  graph  where  V  = 
\v\ ,  V2 , vl}  is  a  set  of  vertices  (actors),  and 
E  =  {ei,  e2, ex}  is  a  set  of  directed  edges,  which  repre¬ 
sent  communication  between  actors  according  to  MDSDF  se¬ 
mantics.  In  MDSDF  graphs,  actor  firings  are  indexed  (in  their 
associated  “firing  spaces”)  by  n-dimensional  vectors,  where 
the  values  of  n  depend  on  the  dimensions  of  the  data  that  are 
produced  and  consumed  (n  =  1  corresponds  to  conventional 
single-dimensional,  SDF-like  firing  sequences)  [3]. 

Suppose  that  v  is  an  MDSDF  actor  with  a  firing  space 
of  M  dimensions,  and  let  rV)i,  for  i  =  1,2,...,  M,  denote 
the  size  of  the  it h  dimension  of  the  firing  space  for  v  in  a 
given  periodic  schedule  S  for  G.  A  periodic  schedule  is  a  se¬ 
quence  of  actor  firings  that  executes  each  actor  at  least  once 
and  produces  no  net  change  in  the  numbers  of  tokens  queued 
on  the  edges  of  G  [2,  3].  We  refer  to  the  M-vector  rv  = 
[rv,i,  2,  tv,m\  as  the  firing  vector  for  actor  v  associated 
with  S.  The  product  of  the  M  elements  of  this  vector  gives 
the  total  number  of  firings  of  v  within  S.  For  a  properly 
constructed  MDSDF  graph,  rv  can  be  computed  by  solving 
a  system  of  equations  called  the  balance  equations  for  the 
graph  [3]. 

Consider,  for  example,  the  3-node  graph  illustrated  in 
Fig.  3.  The  firing  vectors  r^,  r#,  and  rc  can  be  found  by 
solving  the  following  balance  equations  for  i  =  1,  2, ...,  M: 

rA,iOA,i  =  ,  Tb.iOb,!  =  rc,ilc,ii  (1) 

where  lx  =  [Ix,u  Ix,2,  •••,  Ix,m]  and  Ox  =  [Ox, i,  Ox, 2, 
...,  Ox,m\  are  the  M-dimensional  consumption  and  produc¬ 
tion  rates,  respectively,  for  actor  X. 

Now  suppose  that  we  have  an  TV-level  hierarchical  par¬ 
allel  programming  model  (platform  hierarchy)  P,  which  we 
want  to  use  to  implement  a  given  MDSDF  graph  G.  For  ex¬ 
ample,  such  a  parallel  programming  model  could  be  used  as  a 
target  for  code  generation  or  could  be  used  for  an  implementa¬ 
tion  that  is  derived  from  hand  based  on  a  functional  reference 
(“golden  model”)  that  is  based  on  the  MDSDF  specification. 
We  develop  an  TV-level  hierarchical  dataflow  graph  transfor¬ 
mation  approach  to  achieve  such  a  mapping  from  MDSDF  to 
P.  We  refer  to  TV  in  this  context  as  the  platform  depth. 

First,  we  introduce  some  definitions  and  notation  related 
to  hierarchical  dataflow  graphs.  For  a  dataflow  graph  G  = 
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(V,  E ),  let  PiiV)  and  P0{V )  be  the  sets  of  input  and  output 
ports  of  all  actors  in  V,  respectively.  A  supernode  s  in  G  is  an 
actor  (i.e.,  s  G  V)  that  is  associated  with  a  “nested  dataflow 
graph”  H(s),  where  execution  of  s  in  G  corresponds  to  exe¬ 
cution  of  H(s).  In  general,  not  all  actor  ports  in  H (s)  are  con¬ 
nected  in  H(s)  (i.e.,  not  all  of  them  connect  to  edges  within 
H(s)).  The  “unconnected  actor  ports”  are  referred  to  as  the 
interface  ports  of  H (s),  and  these  ports  are  in  one-to-one  cor¬ 
respondence  with  ports  of  actor  s. 

If  G  is  the  “top”  of  the  DH  (i.e.,  G  is  not  encapsulated  by 
a  supernode  in  another  graph),  then  we  say  that  the  nesting 
level  (or  simply  level)  of  G,  denoted  A  (G),  is  1.  Similarly,  for 
each  supernode  s  in  G,  A (H(s))  =  2;  for  each  supemode  t  in 
any  of  these  H(sf  s,  A  (H(t))  =  3,  and  so  on. 

The  DHs  in  our  model  are  non-overlapping,  which  means 
that  for  all  supernodes  within  a  DH  (i.e.,  across  all  levels), 
their  corresponding  nested  dataflow  graphs  do  not  share  any 
actors  or  edges.  Furthermore,  we  assume  that  these  DHs  are 
finite,  which  means  that  the  levels  (A  values)  are  all  bounded. 

We  refer  to  the  maximum  A  value  in  a  DH  D  as  the  depth 
S  of  D.  For  each  i  e  {1,2,...,  5},  we  denote  by  Li  the  set 
of  all  actors  that  are  “at  level  i”.  That  is,  Li  =  V,  and  for 
i  =  2, 3, . . . ,  5, 


Li  =  U{Vh(s)\\(H(s))  =  i},  (2) 

where  Vh(s)  denotes  the  set  of  actors  in  the  nested  dataflow 
graph  H(s). 

DHs  in  our  decomposition  approach  can  be  constructed 
by  designers  as  they  explore  alternative  methods  to  structure 
the  hierarchies  such  that  they  map  efficiently  into  the  par¬ 
allelism  hierarchy  supported  by  the  targeted  platform.  The 
key  constraint  in  construction  of  a  DH  D  is  that  the  depth  of 
each  candidate  DH  should  equal  the  platform  depth.  In  Sec¬ 
tion  4.3,  we  illustrate  how  a  DH  can  be  constructed  naturally 
from  understanding  of  the  flowgraph  structure  of  an  applica¬ 
tion.  However,  DHs  can  also  be  targeted  by  automated  tools. 
Exploration  of  such  automated  DH  construction  tools  is  a  use¬ 
ful  topic  for  future  work. 

We  have  developed  a  systematic  method,  called  multidi¬ 
mensional  DH  mapping ,  to  specify  and  map  these  DHs  into 
hierarchies  of  smaller  graphs,  which  can  be  mapped  to  succes¬ 
sively  lower  levels  of  the  targeted  platform  hierarchy.  Fig.  2 
illustrates  this  approach  for  an  MDSDF  graph.  The  designer 
can  construct  the  DHs  bottom-up  or  top-down.  At  each  it h 
level  (i  >  1)  of  the  DH,  one  or  more  groups  {clusters)  of  con¬ 
nected  actors  are  combined  into  units  that  are  viewed  as  indi¬ 
vidual  supernodes  from  level  (i  —  1).  Groups  of  actors,  includ¬ 
ing  supemodes,  that  are  contained  within  such  clusters  are 
then  scheduled  together  by  adapting  techniques  for  SDF-  and 
MDSDF-based  clustered  graph  analysis  and  scheduling  [12, 
3].  Use  of  these  techniques  to  systematically  derive  produc¬ 
tion  and  consumption  tuples  associated  with  actors  at  differ¬ 
ent  levels  of  the  design  hierarchy,  as  well  as  firing  vectors , 
which  determine  the  relative  rates  at  which  different  actors  in 
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Fig.  2.  An  example  of  a  DH  for  an  MDSDF  specification. 

a  cluster  execute,  is  illustrated  in  Fig.  2. 

The  actors  labeled  with  the  prefixes  intin  and  intout 
in  Fig.  2(c)  represent  interface  input  and  interface  output  ac¬ 
tors  that  are  inserted  based  on  the  selected  DH.  These  actors 
represent  interfaces  to  the  enclosing  supernodes  and  serve  to 
inject  data  from  input  edges  and  to  output  edges  of  the  su¬ 
pernodes,  while  providing  “standalone”  dataflow  graph  rep¬ 
resentations  for  each  level  of  the  DH.  Using  these  standalone 
representations,  buffer  management  and  scheduling  are  per¬ 
formed  to  ensure  correct,  consistent  execution  while  mapping 
the  actors  in  each  DH  level  Li  into  the  corresponding  it h  level 
of  the  targeted  DSP  platform. 

The  production  and  consumption  rates  associated  with  the 
interface  input  and  interface  output  actors  are  derived  system¬ 
atically  using  the  cluster  analysis  techniques  described  above. 
Presently,  we  compute  these  rates  by  hand,  as  our  emphasis 
in  this  work  is  on  demonstrating  the  overall  design  method¬ 
ology  and  its  utility  on  a  practical  case  study.  However,  the 
process  can  readily  be  automated  since  it  is  based  on  formal 
dataflow  principles.  Development  of  automated  tool  support 
for  the  design  methodology  developed  in  this  paper  is  a  useful 
direction  for  further  work. 

We  omit  further  details  of  our  multidimensional  DH  map¬ 
ping  approach  in  this  paper  due  to  space  limitations.  We 
demonstrate  the  utility  of  the  approach  in  the  next  section  with 
a  case  study  of  an  important  multidimensional  signal  process¬ 
ing  subsystem,  image  histogram  computation. 


4.  CASE  STUDY 

To  demonstrate  our  proposed  method  for  mapping  MDSDF 
design  hierarchies,  we  map  an  image  processing  application 
based  on  integral  histogram  computation  [13]  onto  a  GPU  tar¬ 
get  platform. 

The  integral  histogram  (JH)  first  maps  pixels  into  a  set 
of  non-overlapping  ranges  (“bins”),  and  then  performs  a  2- 
D  scan.  Two  scan  orders,  cross- weave  and  wavefront,  are 


explored  in  [14].  The  cross- weave  scan  processes  the  image 
in  the  first  dimension  (horizontal  scan)  followed  by  a  scan 
in  the  second  dimension  (vertical  scan).  Instead  of  applying 
two  passes,  the  wavefront  scan  propagates  an  anti-diagonal 
wavefront  calculation  as  it  operates  through  a  single  scan. 

In  our  experiments,  we  incorporate  use  of  a  tiled  image 
processing  approach,  where  the  image  is  separated  into  blocks 
(tiles)  of  neighboring  pixels.  Tiled  approaches  can  be  useful 
for  GPU  implementation  to  enhance  parallel  execution  across 
multiple  threads  [4].  In  particular,  we  explore  in  this  case 
study  a  tiled  integral  histogram  (77//)  approach  for  efficient 
mapping  into  GPU  implementations. 

The  overall  input  image  size  for  IH  computation  is  de¬ 
noted  as  (Iw  x  Ih)  pixels,  and  the  number  of  histogram  bins 
is  denoted  as  N^.  In  TIH  computation,  an  image  is  tiled  as 
an  ( Nw  x  Nh)  rectangular  arrangement  of  tiles,  where  each 
tile  has  a  (Tw  x  Th)  rectangular  arrangement  of  pixels.  Here, 
Tw  =  Iw/Nw ,  and  Th  =  Ih/Nh.  For  each  (Tw  x  Th)  tile, 
the  IH  is  calculated  independently.  After  computation  of  all 
(. Nw  x  Nh)  tile-level  IHs,  the  results  can  be  processed  to  de¬ 
rive  the  image-level  IH  result. 

We  experiment  with  both  tiled  and  non-tiled  versions  for 
the  cross-weave  scan.  We  have  observed  that  non-tiled  con¬ 
figurations  of  our  wavefront-based  IH  actor  perform  with  un¬ 
acceptable  latency  on  the  targeted  GPU,  and  therefore,  we 
employ  only  tiled  configurations  when  using  the  wavefront 
scan. 

4.1.  Actor  Design 

For  GPU-based  implementation  of  IH  computation,  we  de¬ 
sign  three  types  of  two-dimensional  signal  processing  actors. 
These  actors  are  parameterized  so  that  they  can  be  stati¬ 
cally  or  dynamically  configured  (e.g.,  using  parameterized 
dataflow  [15]  integration  with  MDSDF)  for  the  desired  type 
of  IH  computation.  This  parameterization  in  conjunction  with 
our  multidimensional  DH  mapping  approach  helps  design¬ 
ers  to  explore  trade-offs  involving  different  IH  computation 
strategies  in  conjunction  with  efficient  parallel  realizations  of 
these  strategies. 

Each  of  the  three  actors  employed  in  our  IH  case  study 
has  a  single  input  port  and  a  single  output  port.  These  actors 
are  described  as  follows. 

First,  the  Bin- Check  actor  determines  bin  member¬ 
ship  for  pixels.  The  actor  executes  pixel  checks  of  an 
image  column  for  all  bins  with  CONS  =  (1 , //J  and 
PROD  =  (1,  Ih  x  Nb).  Here,  and  in  the  remainder  of 
this  section,  we  denote  the  two-dimensional  (MDSDF)  pro¬ 
duction  and  consumption  rates  of  a  given  actor  port  as  PROD 
and  CONS ,  respectively. 

Second,  the  Intra-Tile-IH  actor  computes  the  IH,  where 
the  size  of  the  input  tile  is  specified  by  the  actor  parameters 
Tw  (width)  and  Th  (height),  and  the  scan  order  is  specified  by 
the  scan  order  parameter  of  the  actor.  The  supported  settings 
for  the  scan  order  parameter  are: 


Fig.  3.  MDSDF  graph  for  optionally-tiled  IH  computation. 


Table  1.  Application  modes. 


App 

mode 

Method 

V2 

SOP 

V3 

SOP 

V4 

SOP 

APP-CWS 

cross-weave  TIH 

CWS 

HS 

VS 

APP-WFS 

wavefront  TIH 

WFS 

WFS 

IDLE 

APP-NT 

no  tiling 

NT 

IDLE 

IDLE 

•  CWS:  Compute  the  IH  using  a  cross- weave  scan  with 
tiling.  The  actor  ports  satisfy  CONS  =  PROD  = 
(' Tw,Th ) 

•  WFS:  Compute  the  IH  using  a  wavefront  scan  with 
tiling.  The  ports  again  satisfy  CONS  =  PROD  = 
{TWl  Th) 

•  NT:  Compute  the  IH  using  a  cross-weave  scan  without 
tiling  —  that  is,  calculate  the  IH  for  the  input  image 
directly  with  CONS  =  PROD  =  (! Tw,Th ). 

The  Inter-Tile-IH  actor  performs  accumulation  among 
tiles  with  a  parameter,  called  the  accumulation  order  param¬ 
eter,  to  support  different  scan  orders  for  performing  the  ac¬ 
cumulation.  In  particular,  horizontal,  vertical,  and  wavefront 
scans  are  used  for  accumulation  order  settings  that  are  de¬ 
noted  HS,  VS,  and  WFS,  respectively.  The  actor  ports  of 
this  actor  (regardless  of  the  accumulation  order  setting)  sat¬ 
isfy  CONS  =  PROD  =  (Iw,Ih).  In  addition,  the  accumu¬ 
lation  order  parameter  can  be  set  to  the  value  IDLE  to  bypass 
any  accumulation.  While  in  the  IDLE  configuration,  the  ac¬ 
tor  performs  no  computation,  and  simply  passes  its  input  to 
its  output  (through  a  simple  pointer  transfer  to  avoid  memory 
transfer  overhead). 


4.2.  Application  Graph 

Given  the  actors  developed  in  Section  4.1,  one  can  implement 
the  IH  application  with  the  MDSDF  graph  shown  in  Fig.  3. 
The  desired  scan  orders  and  tiling  settings  can  be  achieved 
by  setting  the  actor  parameter  values  appropriately.  In  the 
experiments,  we  show  performance  comparisons  among  three 
specific  application  modes,  which  are  defined  by  the  groups 
of  parameter  settings  shown  in  Table  1.  Here,  SOP  stands  for 
“scan  order  parameter.” 


4.3.  DH  Exploration 

We  customize  the  implementations  for  the  different  applica¬ 
tion  modes  by  examining  their  MDSDF  application  graph 
representations  separately,  and  deriving  separate  DHs  to 
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Fig.  4.  Hierarchical  dataflow  graphs  for  cross-weave  TIH. 

guide  the  application  mapping  processing.  Taking  the  ap¬ 
plication  mode  labeled  APP-CWS  as  an  example,  we  show 
a  DH  in  Fig.  4  that  can  be  used  to  derive  an  efficient  imple¬ 
mentation  on  the  targeted  GPU.  In  the  grid  level  of  target 
platform  parallelism,  which  is  illustrated  in  Fig.  4(a),  the  2-D 
indices  shown  above  the  actors  represent  the  corresponding 
firing  vectors  that  are  derived  from  the  DH  (see  Section  3). 
Each  actor  in  the  top  level  of  the  DH  is  mapped  to  a  kernel 
function  in  the  GPU,  and  the  firing  vector  is  used  to  configure 
the  grid  size. 

Fig.  4(b)  depicts  the  second  level  (i.e.,  block  level)  for  the 
Intra-Tile-IH  actor.  Fig.  4(b)  shows  a  hierarchical  dataflow 
subgraph  that  specifies  the  internal  functionality  for  the  Intra- 
Tile-IH  actor.  To  avoid  non-coalesced  memory  access,  the  in¬ 
put  data  is  loaded  and  transposed  in  the  shared  memory  by  the 
G-to-S  Loader  actor  before  the  horizontal  scan  (G-to-S  stands 
for  “global-to-shared”).  After  the  scan  for  each  data  row,  the 
results  are  transferred  from  the  shared  memory  back  to  the 
global  memory  by  the  S-to-G  (“shared  to  global”)  Loader  ac¬ 
tor.  Finally,  a  vertical  scan  is  performed  to  obtain  the  IH  for 
the  input  tile. 

4.4.  Experiments 

In  our  experiments,  an  NVIDIA  GTX260  GPU  and  an  Intel 
Xeon  3 GHz  CPU  are  used.  We  compare  the  three  different 
application  modes  in  Table  1.  Table  2  depicts  the  grid  and 
block  sizes  for  GPU  kernels.  Performance  is  compared  for 
four  image  sizes  (Iw  x  J^):  32x32,  64x64,  256x256,  and 
512x512.  Based  on  the  number  of  GPU  threads  employed 
for  each  kernel,  we  choose  a  tile  size  of  (32  x  16)  in  the  APP- 
CWS  mode  for  all  image  sizes.  For  the  APP-WFS  mode,  tile 
sizes  of  (4  x  4),  (8  x  8),  (16  x  8),  and  (32  x  16)  are  chosen  for 
successively  larger  image  sizes.  We  evaluate  the  frame  pro¬ 
cessing  time,  including  the  time  required  for  memory  transfer 
from  the  host  to  the  device  (GPU)  and  the  processing  time  on 
the  device.  We  do  not  include  the  time  for  memory  transfer 
from  the  device  back  to  the  host  because  many  applications 


Table  2.  Grid  sizes  (upper)  and  block  sizes  (lower)  derived 
from  DH  in  our  experiments. 


mode 

V2  kernel 

V3  kernel 

Y4  kernel 

APP-CWS 

(Nw,NhNb) 
(Tw,  1) 

(1  ,Nb) 
(Tw,Th) 

(1  ,Nb) 

( Tw,Th ) 

APP-WFS 

(1  ,Nb) 
(Nw,Nh) 

(1  ,Nb) 
(Tw,Th) 

N/A 

APP-NT 

(1  ,Nb) 

(W) 

N/A 

N/A 

that  employ  IH  can  be  implemented  on  the  GPU  efficiently 
without  need  for  data  transfer  back  to  the  CPU. 

Fig.  5  shows  the  frame  rates  (i.e.,  1/r,  where  r  repre¬ 
sents  the  average  time  in  seconds  required  to  process  a  single 
frame)  for  various  bin  sizes  ranging  from  16  to  1024.  From 
the  experimental  results,  we  see  that  the  GPU  implementation 
of  the  IH  consistently  outperforms  the  CPU  implementation, 
and  that  the  speedup  gains  are  approximately  35X  for  image 
sizes  32x32  and  64x64,  67X  for  image  size  256x256,  and  75X 
for  image  size  512x512. 

Among  the  different  GPU  implementations  for  the  32x32 
image  size  case,  IH  without  tiling  (APP-NT)  provides  the 
best  performance  since  it  avoids  overhead  from  tiling.  In 
the  64x64  case,  however,  APP-NT  suffers  from  reduced  inter¬ 
thread  parallelism  due  to  the  large  amount  of  shared  memory 
required.  The  best  performance  is  achieved  in  the  APP-WFS 
mode,  as  this  mode  provides  more  threads  in  the  V2  kernel 
and  less  overhead  due  to  tiling  (V4  is  bypassed).  With  im¬ 
age  sizes  of  256x256  and  512x512,  we  must  use  tiling  due 
to  the  size  limitations  of  the  shared  memory.  Compared  to 
APP-WFS,  APP-CWS  can  offer  better  frame  rates  by  provid¬ 
ing  more  effective  parallel  execution  on  the  target  platform. 

In  summary,  the  best  application  mode  for  IH  calculation 
depends  on  the  image  size,  and  thus  parameterized  MDSDF 
application  modeling  in  conjunction  with  our  multidimen¬ 
sional  DH  mapping  approach  are  useful  design  methods  to 
map  IH  computations  systematically  onto  the  targeted  GPU 
platform.  Such  a  systematic  mapping  approach  leads  to  de¬ 
signs  that  can  be  mapped  more  efficiently,  and  that  are  more 
portable,  and  easier  to  maintain  and  extend. 

5.  CONCLUSION 

In  this  paper,  we  have  developed  a  novel  design  method, 
building  on  the  MDSDF  model  of  computation,  for  hierar¬ 
chical  exploitation  of  parallelism  in  multidimensional  signal 
processing  applications.  This  method  allows  designers  to 
explore  alternative  implementations  in  a  manner  that  sepa¬ 
rates  platform- specific  parallel  processing  optimization  from 
the  behavioral  specification,  thereby  enhancing  portability 
and  trade-off  exploration.  More  specifically,  our  multidi- 
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Fig.  5.  Performance  comparisons  for  different  image  sizes. 

mensional  design  hierarchy  model  provides  an  intermediate 
model  that  provides  a  formal  linkage  between  hierarchical 
layers  of  parallelism  in  the  target  platform  and  correspond¬ 
ing  subsystems  of  the  application  that  will  be  mapped  onto 
these  layers.  In  our  approach,  graph  clustering  and  MDSDF 
dataflow  analysis  are  applied  in  novel  ways  to  map  applica¬ 
tions  to  target  platforms  that  employ  parallelism  at  multiple 
levels.  Experimental  results  show  that  fast  GPU  implemen¬ 
tations  can  be  derived  from  the  approach,  as  well  as  efficient 
trade-off  analysis  and  optimization  across  different  applica¬ 
tion  modes. 
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